Picture for Zhibo Yang

Zhibo Yang

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Add code
May 28, 2026
Viaarxiv icon

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

Add code
May 21, 2026
Viaarxiv icon

Multi-domain Multi-modal Document Classification Benchmark with a Multi-level Taxonomy

Add code
May 11, 2026
Viaarxiv icon

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Add code
May 05, 2026
Viaarxiv icon

Triviality Corrected Endogenous Reward

Add code
Apr 13, 2026
Viaarxiv icon

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Add code
Mar 18, 2026
Viaarxiv icon

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Add code
Mar 11, 2026
Viaarxiv icon

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Add code
Mar 04, 2026
Viaarxiv icon

UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

Add code
Feb 03, 2026
Viaarxiv icon

BabyVision: Visual Reasoning Beyond Language

Add code
Jan 10, 2026
Viaarxiv icon